PAN Data — Biologic Age & Vitamin D

Author

Ishaan Ranjan

Introduction

There is growing evidence that vitamin D supplementation is associated with slower “biologic” aging—the age reflected by molecular biomarkers rather than the number of birthdays on the calendar. In this analysis I use data from the Precision Aging Network (PAN) to examine three established DNA‑methylation–based “epigenetic clocks”—Horvath 2013, AltumAge, and DunedinPACE.

Research Question

Is self‑reported vitamin D supplementation associated with lower (“younger”) values in any three epigenetic clock measures after accounting for chronological age?

Background/Tools

  • Epigenetic clocks: DNA‑methylation signatures that act as molecular clocks. We use three complementary models:

    • Horvath 2013 — 353 CpGs, pan‑tissue; returns an epigenetic age in years. Age acceleration = EpigenAge − ChronAge.
    • AltumAge — ≈21 000 CpGs, deep‑learning model trained on >140 data sets; higher accuracy, interpreted like Horvath.
    • DunedinPACE — 38 CpGs; outputs the pace of aging (1.00 = average, >1 = faster, <1 = slower).
  • Why three clocks? Horvath and AltumAge give a snapshot of cumulative aging, while DunedinPACE acts like a speedometer, telling us how rapidly aging processes are unfolding now. Using all three offers both level and rate perspectives on biological aging.

  • PAN dataset: Community‑dwelling older adults with phenotypic and lifestyle data including supplement use.

Quick Interpretation Cheat Sheet

Clock Output “Higher” value implies
Horvath 2013 years (epigenetic age) Biological age older than calendar age (accelerated aging)
AltumAge years (epigenetic age) Same as Horvath 2013 but with higher precision
DunedinPACE ratio (bio‑years per chrono‑year) Aging faster than average (> 1.0)
  • Key R packages: tidyverse for data wrangling, table1 for publication‑ready descriptive tables, ggplot2 and ggpubr for displaying data.

Creating Analysis Dataset

Below we reproduce the setup and data‑loading chunks exactly as in the original file; the only goal is to annotate what they are doing

Code
# Read PAN dataset and create analysis subset
raw_pan <- read_csv("keys2025data.csv")

# Keep only variables required for analysis, renaming on the fly
pan <- raw_pan %>%
  select(
    hml_id,
    age,
    vitamin_d  = vitd_supplement,   # renamed here
    dose_mg,
    horvath2013,
    altumage,
    dunedinpace,
    sex,
    education = edu_yrs_hml,       # renamed here
    bmi,
    health_medical_cancer
  ) %>%
  drop_na(age, vitamin_d, horvath2013, altumage, dunedinpace)  # updated names used here

pan$cancer<- factor(pan$health_medical_cancer,
                              levels = c(0, 1),
                              labels = c("no", "yes"))



# Save subset for reproducibility
write_csv(pan, "pan_subset.csv")

# Convert from wide to long for optional analyses (wide → long)
pan_long <- pan %>%
  pivot_longer(
    cols = c(horvath2013, altumage, dunedinpace),
    names_to = "clock",
    values_to = "epigenetic_age"
  )

### this code will enable prettier output using the rms package

pand <- datadist(pan)
options(datadist='pand')

Data Preparation

The code chunk below reads the PAN dataset (saved as data2.csv) and drops any participants with missing information on the stratification variable vitd_supplement. This mimics the filtering step in the diabetes tutorial where rows with missing outcome data were excluded.

Descriptive Statistics

Before fitting any models we explore the sample characteristics. The Table 1 below compares participants who do versus do not take a vitamin D supplement, along with an overall column. Continuous variables are summarised with the mean (standard deviation) and median [range]; categorical variables are given as counts (percent).

Table 1 using the table1 package

The next chunk replicates the diabetes example almost verbatim—only the variable list has changed to include the three epigenetic clocks (horvath2013, altumage, dunedinpace) and key covariates (age, sex, highest_education_level_completed, bmi). The argument overall = "Overall" renames the grand‑total column, and render.missing = NULL suppresses the Missing row that would otherwise appear.

Interpretation of Table 1

Table 1 describes the study cohort stratified by vitamin‑D supplementation status (“With” vs. “Without”) and for the overall sample.
What to look for: Compare the means/medians in each column. If vitamin‑D users systematically differ (older age, higher educational attainment, lower BMI) those variables could confound any relationship between vitamin D and the epigenetic clocks.
Why it matters: The more similar the two columns are, the more confident we can be that the analyses isolate the effect of vitamin D rather than underlying demographic or health differences.

For the poster, this should be labeled: Table 1. PAN Demographics and Summary Statistics

Code
# --- Create the descriptive table ------------------------------------------
tab <- table1(
  ~ horvath2013 + altumage + dunedinpace +
    age + sex + education + bmi + cancer | vitamin_d,
  data               = pan,
  overall            = "Overall",
  render.continuous  = "Mean(SD)",
  digits             = 3,
  res                = 300
)

tab 
no
(N= 369)
yes
(N= 59)
Overall
(N= 428)
horvath2013 64.3(7.35) 65.7(6.66) 64.5(7.27)
altumage 60.9(5.81) 62.1(4.75) 61.1(5.69)
dunedinpace 1.01(0.110) 1.04(0.116) 1.01(0.111)
age 64.2(7.70) 66.0(6.71) 64.5(7.59)
sex
Female 238 (64.5%) 54 (91.5%) 292 (68.2%)
Male 131 (35.5%) 5 (8.5%) 136 (31.8%)
education 16.1(2.30) 15.9(2.15) 16.0(2.27)
Missing 159 (43.1%) 12 (20.3%) 171 (40.0%)
bmi 26.9(5.47) 29.1(7.20) 27.2(5.78)
Missing 2 (0.5%) 0 (0%) 2 (0.5%)
cancer
no 338 (91.6%) 53 (89.8%) 391 (91.4%)
yes 31 (8.4%) 6 (10.2%) 37 (8.6%)

As a comparison, we show the vital demographic data, labeled, Table 2. VITAL Study Demographics and Summary Statistics.

A recent study suggested that Vitamin D supplements resulted in a 3 year difference in biologic age (https://www.medicalnewstoday.com/articles/vitamin-d-suppltements-may-slow-biological-aging-preserve-telomere-length)

and another study (DO-HEALTH) showed that 3-year supplementation showed a slight improvement in duninPACE, but the effect was greater in fish oil supplementation than vitamin D (which is at odds with the VITAL study).

Code
# Read in .csv ----------
vital_dat <- read.csv("vital_data_2019.csv")

# Create labeled levels for categorical variables ---------------
vital_dat$vitamin_d <- factor(vital_dat$vitdactive,
                              levels = c(1, 0),
                              labels = c("Yes", "No"))

vital_dat$fish_oil <- factor(vital_dat$fishoilactive,
                             levels = c(1, 0),
                             labels = c("Yes", "No"))

vital_dat$sex_factor <- factor(vital_dat$sex,
                               labels = c("Male", "Female"))

vital_dat$race_factor <- factor(vital_dat$race,
                                labels = c("Non-Hispanic White", "Black", "Hispanic", "Asian", "Native American or Alaskan", "Others/Unknown"))

vital_dat$malcancer_factor <- factor(vital_dat$malca,
                                                    levels = c(1, 0),
                                        labels = c("Malignant Cancer", "No Malignant Cancer"))

vital_dat$majorcvd_factor <- factor(vital_dat$majorcvd,
                                                    levels = c(1, 0),
                               labels = c("Has Major CVD", "No Major CVD"))

# Create nice labels for variables used in table -----------
label(vital_dat$ageyr) <- "Age"
label(vital_dat$sex_factor) <- "Sex"
label(vital_dat$race_factor) <- "Race"
label(vital_dat$bmi) <- "BMI"
label(vital_dat$vitamin_d) <- "Randomization to Active Vitamin D"
label(vital_dat$fish_oil) <- "Randomization to Active N-3 Fatty Acids"
label(vital_dat$malcancer_factor) <- "Cancer"
Code
table1(~ ageyr + sex_factor + race_factor + bmi + malcancer_factor | vitamin_d,
       data = vital_dat, render.missing = NULL, render.continuous="Mean(SD)")
Yes
(N=12927)
No
(N=12944)
Overall
(N=25871)
Age 66.6(7.05) 66.6(7.07) 66.6(7.06)
Sex
Male 6380 (49.4%) 6406 (49.5%) 12786 (49.4%)
Female 6547 (50.6%) 6538 (50.5%) 13085 (50.6%)
Race
Non-Hispanic White 9013 (69.7%) 9033 (69.8%) 18046 (69.8%)
Black 2553 (19.7%) 2553 (19.7%) 5106 (19.7%)
Hispanic 516 (4.0%) 497 (3.8%) 1013 (3.9%)
Asian 188 (1.5%) 200 (1.5%) 388 (1.5%)
Native American or Alaskan 118 (0.9%) 110 (0.8%) 228 (0.9%)
Others/Unknown 259 (2.0%) 264 (2.0%) 523 (2.0%)
BMI 28.1(5.68) 28.1(5.79) 28.1(5.74)
Cancer
Malignant Cancer 793 (6.1%) 824 (6.4%) 1617 (6.3%)
No Malignant Cancer 12134 (93.9%) 12120 (93.6%) 24254 (93.7%)

Statistical Analysis

Methods

We use linear regression to examine the both the relationship between biologic and chronologic age as well as evaluate whether or not there is evidence that vitamin D supplementation slows down biologic aging. We compare these results to two large studies that have come out recently. We don’t see a relationship between Horvath biologic aging and vitamin d supplementation, either as a univariate or multivariable analysis adjusting for other variables (two plots show this, from the RMS modeling).

Results

Output shows that there is a statistically linear relationship between biologic and chronologic aging, with the \(R^2\) of 0.585 showing good predictive capacity (for human cohorts an \(R^2\) of 0.2 is typical for prediction). Interestingly, the Horvath epigenetic clock is higher than chronologic age (by a little over 1 year on average), basically showing the PAN population shows slightly higher biologic age compared to their chronologic age. [[YOU NEED TO DO THIS ANALYSIS for the DunedinPACE as well]]

There are no statistically significant associations between the Horvath epigenetic clock and any baseline phenotypic variables. [you will need to do this for DunedinPACE]

Discussion

Correlation of clocks with chronologic age. The Hovath (a ``first generation” epigenetic clock) has good correlation with chronologic age in the PAN data (\(\rho\) = 0.76), while the DunedinPACE shows low correlation , \(\rho\) =0.16. These results are similar to the DO-HEALTH, PhenoAge r = 0.60, GrimAge r = 0.92, GrimAge2 r = 0.71, DunedinPACE r = 0.19. $\rho$ = r (Pearson correlation coefficient)

However, we don’t see a difference in biologic age at baseline in those taking vitamin D supplements, which is contrary to the VITAL trial, but perhaps similar to the DO-HEALTH study However, one challenge in this analysis using PAN, compared to those studies, is that PAN in a cross-sectional study - where vitamin D supplemention is a variable, where both the VITAL and DO-HEALTH were randomized studies that measured effects after 3-years of follow-up.

Code
vit_d.yes <- subset(pan, vitamin_d =="yes")
vit_d.no <- subset(pan, vitamin_d == "no")

##  linear regression between horvath age and chronologic age -- R2 shows prediction
horvath13.mod <- ols(horvath2013 ~ age, data=pan)
horvath13.mod
Linear Regression Model

ols(formula = horvath2013 ~ age, data = pan)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs     428    LR chi2    376.39    R2       0.585    
sigma4.6862    d.f.            1    R2 adj   0.584    
d.f.    426    Pr(> chi2) 0.0000    g        6.382    

Residuals

     Min       1Q   Median       3Q      Max 
-13.7997  -2.9052  -0.1181   2.7163  30.3684 

          Coef    S.E.   t     Pr(>|t|)
Intercept 17.2580 1.9402  8.89 <0.0001 
age        0.7325 0.0299 24.50 <0.0001 
Code
## correlation 
horvath.cor <- with(pan, cor(horvath2013, age))
horvath.cor
[1] 0.7648368
Code
##  linear regression between dunedinpace and chronologic age
duned.mod <- ols(dunedinpace ~ age, data=pan)
duned.mod
Linear Regression Model

ols(formula = dunedinpace ~ age, data = pan)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs     428    LR chi2     12.29    R2       0.028    
sigma0.1094    d.f.            1    R2 adj   0.026    
d.f.    426    Pr(> chi2) 0.0005    g        0.021    

Residuals

      Min        1Q    Median        3Q       Max 
-0.350807 -0.074780 -0.009595  0.063450  0.344107 

          Coef   S.E.   t     Pr(>|t|)
Intercept 0.8550 0.0453 18.87 <0.0001 
age       0.0025 0.0007  3.52 0.0005  
Code
## correlation with age

duned.cor <- with(pan, cor(dunedinpace, age))
duned.cor
[1] 0.1682299
Code
###  are the slopes different between those taking vit d and those that are not? 

horvath13.mod.vitd.y <-  ols(horvath2013 ~ age, data=vit_d.yes)
horvath13.mod.vitd.n <-  ols(horvath2013 ~ age, data=vit_d.no)

horvath13.mod.vitd.y
Linear Regression Model

ols(formula = horvath2013 ~ age, data = vit_d.yes)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs      59    LR chi2     36.19    R2       0.458    
sigma4.9436    d.f.            1    R2 adj   0.449    
d.f.     57    Pr(> chi2) 0.0000    g        5.184    

Residuals

     Min       1Q   Median       3Q      Max 
-12.3651  -2.8332  -0.1596   2.6650   8.4877 

          Coef    S.E.   t    Pr(>|t|)
Intercept 21.2484 6.4247 3.31 0.0016  
age        0.6725 0.0968 6.95 <0.0001 
Code
horvath13.mod.vitd.n
Linear Regression Model

ols(formula = horvath2013 ~ age, data = vit_d.no)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs     369    LR chi2    338.01    R2       0.600    
sigma4.6547    d.f.            1    R2 adj   0.599    
d.f.    367    Pr(> chi2) 0.0000    g        6.543    

Residuals

     Min       1Q   Median       3Q      Max 
-13.7729  -2.8988  -0.0827   2.7365  30.3316 

          Coef    S.E.   t     Pr(>|t|)
Intercept 16.8007 2.0389  8.24 <0.0001 
age        0.7396 0.0315 23.46 <0.0001 
Code
vitd.y <- predict(horvath13.mod, data=vit_d.yes)
vitd.n <-  predict(horvath13.mod, data=vit_d.no)


# how we show the interaction test in a plot 

# first we create an object from the interaction plot
interaction.plot <- ols(horvath2013 ~ age*vitamin_d, data=pan)
pan$predicted <-predict(interaction.plot)
  
  
  
  ggplot(data=pan, aes(x=age, y=horvath2013, group=vitamin_d, color=vitamin_d)) +
  geom_smooth(method=lm, se=F) +
  geom_point(size=3) +
  scale_color_manual(values=c("steelblue", "gray")) + 
  theme_classic() +
  labs(x="Chronologic Age", 
       y = "Horvath Age") 

Code
  # and we can save this as a .png graph so it will look better on your poster
  
  png("Figure 1. Interaction Horvath versus Chron Age by VitD", res = 300)
  ggplot(data=pan, aes(x=age, y=horvath2013, group=vitamin_d, color=vitamin_d)) +
  geom_smooth(method=lm, se=F) +
  geom_point(size=3) +
  scale_color_manual(values=c("steelblue", "gray")) + 
  theme_classic() +
  labs(x="Chronologic Age", 
       y = "Horvath Age") 
  dev.off()
quartz_off_screen 
                2 
Code
multivariable.mod <- ols(horvath2013 ~ sex + education + bmi + cancer + vitamin_d, data=pan)

plot(anova(multivariable.mod), res=300)

Code
multivariable.mod <- ols(
  horvath2013 ~ sex + education + bmi + cancer + vitamin_d,
  data = pan
)

# Panelled partial-effect plots with a custom y-axis title
plot(
  Predict(multivariable.mod),
  res  = 300,
  ylab = "Horvath Clock"   # <- new Y-axis label
)

Dunedin Pace Analysis

Code
## subset vitamin-D groups
vit_d.yes <- subset(pan, vitamin_d == "yes")
vit_d.no  <- subset(pan, vitamin_d == "no")

## linear regression between DunedinPACE and chronologic age
dunedin.mod <- ols(dunedinpace ~ age, data = pan)
dunedin.mod   # view coefficients & R²
Linear Regression Model

ols(formula = dunedinpace ~ age, data = pan)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs     428    LR chi2     12.29    R2       0.028    
sigma0.1094    d.f.            1    R2 adj   0.026    
d.f.    426    Pr(> chi2) 0.0005    g        0.021    

Residuals

      Min        1Q    Median        3Q       Max 
-0.350807 -0.074780 -0.009595  0.063450  0.344107 

          Coef   S.E.   t     Pr(>|t|)
Intercept 0.8550 0.0453 18.87 <0.0001 
age       0.0025 0.0007  3.52 0.0005  
Code
## correlation with age
dunedin.cor <- with(pan, cor(dunedinpace, age))
dunedin.cor
[1] 0.1682299
Code
### compare slopes in Vit-D users vs non-users
dunedin.mod.vitd.y <- ols(dunedinpace ~ age, data = vit_d.yes)
dunedin.mod.vitd.n <- ols(dunedinpace ~ age, data = vit_d.no)

dunedin.mod.vitd.y
Linear Regression Model

ols(formula = dunedinpace ~ age, data = vit_d.yes)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs      59    LR chi2      0.07    R2       0.001    
sigma0.1170    d.f.            1    R2 adj  -0.016    
d.f.     57    Pr(> chi2) 0.7870    g        0.005    

Residuals

     Min       1Q   Median       3Q      Max 
-0.23828 -0.09014 -0.01276  0.07308  0.26681 

          Coef   S.E.   t    Pr(>|t|)
Intercept 0.9955 0.1521 6.55 <0.0001 
age       0.0006 0.0023 0.27 0.7915  
Code
dunedin.mod.vitd.n
Linear Regression Model

ols(formula = dunedinpace ~ age, data = vit_d.no)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs     369    LR chi2     12.37    R2       0.033    
sigma0.1081    d.f.            1    R2 adj   0.030    
d.f.    367    Pr(> chi2) 0.0004    g        0.023    

Residuals

      Min        1Q    Median        3Q       Max 
-0.348106 -0.072441 -0.007147  0.062607  0.346678 

          Coef   S.E.   t     Pr(>|t|)
Intercept 0.8436 0.0474 17.82 <0.0001 
age       0.0026 0.0007  3.54 0.0005  
Code
## predicted values for plotting
vitd.y <- predict(dunedin.mod, data = vit_d.yes)
vitd.n <- predict(dunedin.mod, data = vit_d.no)

## interaction test (age * vitamin_d)
interaction.plot.dunedin <- ols(dunedinpace ~ age * vitamin_d, data = pan)
pan$predicted_dunedin <- predict(interaction.plot.dunedin)

## quick ggplot
library(ggplot2)

ggplot(data = pan,
       aes(x = age, y = dunedinpace,
           group = vitamin_d, color = vitamin_d)) +
  geom_smooth(method = lm, se = FALSE) +
  geom_point(size = 3) +
  scale_color_manual(values = c("steelblue", "gray")) +
  theme_classic() +
  labs(x = "Chronologic Age",
       y = "DunedinPACE")

Code
## save a high-res PNG for your poster
png("Figure 2. Interaction DunedinPACE versus Chron Age by VitD.png",
    width = 1800, height = 1200, res = 200)

ggplot(data = pan,
       aes(x = age, y = dunedinpace,
           group = vitamin_d, color = vitamin_d)) +
  geom_smooth(method = lm, se = FALSE) +
  geom_point(size = 3) +
  scale_color_manual(values = c("steelblue", "gray")) +
  theme_classic() +
  labs(x = "Chronologic Age",
       y = "DunedinPACE")

dev.off()
quartz_off_screen 
                2 
Code
multivariable.dunedin <- ols(
  dunedinpace ~ sex + education + bmi + cancer + vitamin_d,
  data = pan
)
plot(anova(multivariable.dunedin), res=300)

Code
# Fit model for DunedinPACE
multivariable.dunedin <- ols(
  dunedinpace ~ sex + education + bmi + cancer + vitamin_d,
  data = pan
)

# Panelled partial-effect plots with a custom y-axis title
plot(
  Predict(multivariable.dunedin),
  res  = 300,
  ylab = "DunedinPACE Clock"
)

Scatter plots of Epigenetic Clocks vs Chronologic Age stratified by Vitamin D supplementation

Below we visualize how each epigenetic clock (Horvath 2013, AltumAge, and DunedinPACE) relates to participants’ chronological age.
Each scatter plot is faceted by self‑reported Vitamin D supplement use (“Yes” / “No”), and includes the linear regression equation, the \(R^2\) statistic, and the p‑value for the age–clock association.

Code
df <- pan

plot_df <- df %>%
  mutate(
    vitd_group = case_when(
      !is.na(dose_mg) & dose_mg > 0 ~ "With Vitamin D",
      str_detect(tolower(trimws(vitamin_d)), "^(y(es)?|1|true)$") ~ "With Vitamin D",
      TRUE ~ "Without Vitamin D"
    )
  ) %>%
  filter(!is.na(age), !is.na(horvath2013)) %>%
  mutate(
    vitd_group = factor(vitd_group, levels = c("With Vitamin D", "Without Vitamin D"))
  )

ggplot(plot_df, aes(x = age, y = horvath2013)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  stat_cor(
    aes(label = paste(..rr.label.., ..p.label.., sep = "~','~")),
    r.accuracy = 0.01, p.accuracy = 0.001,
    label.x.npc = "left", label.y.npc = 0.9, size = 4
  ) +
  stat_regline_equation(
    aes(label = ..eq.label..),
    label.x.npc = "left", label.y.npc = 1, size = 4
  ) +
  facet_wrap(~ vitd_group, ncol = 2, scales = "free_x", drop = FALSE) +
  theme_bw() +
  labs(
    title = "Horvath 2013 vs Chronologic Age",
    x = "Chronologic Age (years)",
    y = "Horvath 2013 Epigenetic Age"
  )

Horvath 2013 epigenetic age plotted against chronological age, stratified by Vitamin D supplementation.
Code
df <- pan

plot_df <- df %>%
  mutate(
    vitd_group = case_when(
      !is.na(dose_mg) & dose_mg > 0 ~ "With Vitamin D",
      str_detect(tolower(trimws(vitamin_d)), "^(y(es)?|1|true)$") ~ "With Vitamin D",
      TRUE ~ "Without Vitamin D"
    )
  ) %>%
  filter(!is.na(age), !is.na(altumage)) %>%
  mutate(
    vitd_group = factor(vitd_group, levels = c("With Vitamin D", "Without Vitamin D"))
  )

ggplot(plot_df, aes(x = age, y = altumage)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  stat_cor(
    aes(label = paste(..rr.label.., ..p.label.., sep = "~','~")),
    r.accuracy = 0.01, p.accuracy = 0.001,
    label.x.npc = "left", label.y.npc = 0.9, size = 4
  ) +
  stat_regline_equation(
    aes(label = ..eq.label..),
    label.x.npc = "left", label.y.npc = 1, size = 4
  ) +
  facet_wrap(~ vitd_group, ncol = 2, scales = "free_x", drop = FALSE) +
  theme_bw() +
  labs(
    title = "AltumAge vs Chronologic Age",
    x = "Chronologic Age (years)",
    y = "AltumAge Epigenetic Age"
  )

AltumAge epigenetic age plotted against chronological age, stratified by Vitamin D supplementation.
Code
plot_df <- pan %>%
  mutate(
    vitd_group = case_when(
      !is.na(dose_mg) & dose_mg > 0 ~ "With Vitamin D",
      str_detect(tolower(as.character(vitamin_d)), "^(y(es)?|1|true)$") ~ "With Vitamin D",
      TRUE ~ "Without Vitamin D"
    ),
    log_dunedinpace = log(dunedinpace)
  ) %>%
  filter(!is.na(age), !is.na(log_dunedinpace)) %>%
  mutate(
    vitd_group = factor(vitd_group, levels = c("With Vitamin D", "Without Vitamin D"))
  )

ylim_range.notlog <- range(plot_df$dunedinpace, na.rm = TRUE)
ggplot(plot_df, aes(x = log(age), y = dunedinpace)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  facet_wrap(~ vitd_group, ncol = 2, scales = "free_x", drop = FALSE) +
  coord_cartesian(ylim = ylim_range.notlog) +
  theme_bw() +
  labs(
    title = "DunedinPACE vs log(Chronologic Age)",
    x = "Chronologic Age (log(years))",
    y = "DunedinPACE"
  )

DunedinPACE epigenetic age plotted against chronological age, stratified by Vitamin D supplementation.

Preparing Dataset #1 for Week‑3 Statistical Analyses

Next week we’ll move forward with some statistical analyses. We’ll need a streamlined dataset that keeps only the two epigenetic clocks plus the key covariates (Age, Sex, BMI, vitamin‑D status, and Cancer history).
If the log‑transformed versions of the clocks prove to fit better (see today’s plots) we’ll carry those forward in analysis as well. version i think - Dataset 1

Code
# 1) Load the working dataframe ----------------------------------------------
pan_subset <- read_csv("pan_subset.csv")   # or "data2.csv" if that’s your source

# 2) Keep only the variables we need and reshape ------------------------------
dataset1 <- pan_subset %>%
  select(
    horvath2013, altumage,        # epigenetic clocks
    age, sex, bmi,                # covariates
    vitamin_d,
    health_medical_cancer, education
  ) %>%
  pivot_longer(                   # long format: one row per clock per person
    cols      = c(horvath2013, altumage),
    names_to  = "clock",
    values_to = "clock_value"
  ) %>%
  mutate(
    vitamin_d = factor(
      vitamin_d,
      levels = c("no", "yes"),
      labels = c("Without Vit D", "With Vit D")
    ),
    health_medical_cancer = factor(
      health_medical_cancer,
      levels = c(0, 1),
      labels = c("No", "Yes")
    )
  )

# 3) Save for modelling -------------------------------------------------------
write_csv(dataset1, "dataset1_clocks_long.csv")

DataSet2

Code
# Start from the long file created for Dataset #1
data.graph <- read_csv("dataset1_clocks_long.csv") %>%
  rename(epiage = clock_value)   # epiage = clock value

# Peek at the result (optional)
# glimpse(data.graph)

# Save a copy in case you need it elsewhere
write_csv(data.graph, "data_graph_long_epiage.csv")

Code
library(ggplot2)

ggplot(data.graph, aes(x = age, y = epiage)) +
  geom_point(alpha = 0.6) +
  stat_smooth(method = "lm", se = FALSE, color = "black") +
  facet_grid(clock ~ vitamin_d) +
  labs(
    x = "Chronological Age (years)",
    y = "Epigenetic Age (years)"
  ) +
  theme_bw()

Epigenetic age vs chronological age

Reading the figure

A slope of ≈ 1 means biological age increases roughly one‑for‑one with calendar age (average ageing pace).
If the slope in the With vitamin‑D panel is < the slope in the Without* panel*, that suggests vitamin D users are ageing more slowly per year.
A
lower intercept** in the vitamin‑D group implies they start life (or the modelling baseline) with a “younger‑looking” epigenome.

Interpretation for the vitamin‑D question

Taken together, a flatter slope and/or lower intercept in vitamin‑D users indicates slower or delayed biological ageing relative to non‑users—consistent with the hypothesis that vitamin D supplementation is beneficial for epigenetic ageing. Conversely, overlapping slopes and intercepts would argue against a strong vitamin‑D effect.

Epigenetic age vs BMI

Code
ggplot(data.graph, aes(x = bmi, y = epiage)) +
  geom_point(alpha = 0.6) +
  stat_smooth(method = "lm", se = FALSE, color = "black") +
  facet_grid(clock ~ vitamin_d) +
  labs(
    x = "BMI",
    y = "Epigenetic Age (years)"
  ) +
  theme_bw()

Epigenetic age vs BMI

Reading the figure

  • Positive slopes mean that higher BMI is associated with higher (older) epigenetic age. A less positive or negative slope among vitamin‑D users would suggest that supplementation mitigates the detrimental effect of obsesity on biological ageing.

  • Represents the expected epigenetic age for someone with BMI = 0 (a hypothetical), but differences between intercepts still reflect baseline biological‑age offsets between groups.

Interpretation for the vitamin‑D question

If vitamin‑D users show a shallower slope—even after being stratified by supplementation—this would imply vitamin D dampens the BMI‑related acceleration of epigenetic ageing.

Epigenetic age by sex

Code
ggplot(data.graph, aes(x = sex, y = epiage)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  facet_grid(clock ~ vitamin_d) +
  labs(
    x = "Sex",
    y = "Epigenetic Age (years)"
  ) +
  theme_bw()

Epigenetic age by sex

Reading the figure

Box‑and‑whisker plots display the distribution of epigenetic age for males vs. females.

  • Median line. The central line is the median; compare medians between sexes and vitamin‑D groups.
  • Box & whiskers. Wider boxes signal more variability.

Interpretation for the vitamin‑D question

If sex differences (e.g., males older than females) shrink among vitamin‑D users, it suggests supplementation may resist the disadvantageous sex gap in biological ageing. This seems true as the sexes show an almost opposite trend with the vitamin D supplementation where in the altumage clock with VitD the males a have a lower epigenetic age and in the without VitD in both clocks females haev the lwoer epigenetic age.

Epigenetic age by cancer history

Code
ggplot(data.graph, aes(x = health_medical_cancer, y = epiage)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  facet_grid(clock ~ vitamin_d) +
  labs(
    x = "Cancer History",
    y = "Epigenetic Age (years)"
  ) +
  theme_bw()

Epigenetic age by cancer history

Reading the figure

Box‑and‑whisker plots contrast epigenetic ages between participants with vs. without a history of cancer.

Interpretation for the vitamin‑D question

Vitamin D could plausibly influence cancer pathways and ageing. Observe whether:

  • The difference narrows in the supplemented group → consistent with vitamin D mitigating cancer‑associated age acceleration. It seems very minimal but the cancer history individuals are closer to the individuals without a cancer history.
  • If differences remain similar, vitamin D may not offset cancer‑related ageing.

Epigenetic age by vitamin-D status

Code
ggplot(data.graph, aes(x = vitamin_d, y = epiage)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  facet_wrap(~ clock, ncol = 2) +
  labs(
    x = "Vitamin-D Supplementation",
    y = "Epigenetic Age (years)"
  ) +
  theme_bw()

Epigenetic age by vitamin-D status (x-axis)

Reading the figure

This is the “headline” comparison: epigenetic age by vitamin‑D status across the three clocks.

  • Vertical axis. Years (Horvath & AltumAge) or ageing pace ratio (DunedinPACE).
  • Box position. Lower median or entire box in the “With vitamin‑D” column implies younger/healthier ageing.

Overall interpretation

  • Lower medians and tighter boxes in vitamin‑D users support the hypothesis that supplementation slows or dampens epigenetic ageing.
  • Overlapping boxes (medians and IQRs) would suggest little to no measurable effect.
Code
regression <-lm(dunedinpace ~ log(age)*vitd_group, data=plot_df)
anova(regression)
Analysis of Variance Table

Response: dunedinpace
                     Df Sum Sq  Mean Sq F value    Pr(>F)    
log(age)              1 0.1568 0.156784 13.1323 0.0003255 ***
vitd_group            1 0.0225 0.022523  1.8865 0.1703181    
log(age):vitd_group   1 0.0085 0.008524  0.7140 0.3985943    
Residuals           424 5.0621 0.011939                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
regression.bmi <-lm(horvath2013 ~ bmi*vitd_group, data=plot_df)
anova(regression.bmi)
Analysis of Variance Table

Response: horvath2013
                Df  Sum Sq Mean Sq F value Pr(>F)
bmi              1    80.6  80.593  1.5270 0.2173
vitd_group       1   122.8 122.751  2.3257 0.1280
bmi:vitd_group   1    47.7  47.715  0.9040 0.3422
Residuals      422 22272.9  52.779               
Code
regression.sex <-lm(altumage ~ sex*vitd_group, data=plot_df)
anova(regression.sex)
Analysis of Variance Table

Response: altumage
                Df  Sum Sq Mean Sq F value  Pr(>F)  
sex              1   172.0 171.952  5.4276 0.02029 *
vitd_group       1   125.9 125.881  3.9734 0.04687 *
sex:vitd_group   1    70.3  70.259  2.2177 0.13718  
Residuals      424 13432.8  31.681                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
regression.cancer <-lm(horvath2013 ~ health_medical_cancer*vitd_group, data=plot_df)
anova(regression.cancer)
Analysis of Variance Table

Response: horvath2013
                                  Df  Sum Sq Mean Sq F value   Pr(>F)   
health_medical_cancer              1   432.8  432.84  8.3499 0.004055 **
vitd_group                         1    86.7   86.72  1.6730 0.196563   
health_medical_cancer:vitd_group   1    42.9   42.89  0.8275 0.363521   
Residuals                        424 21978.9   51.84                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The following code is a linear regression that tests the hypothesis:  are the regression lines different for those participants who take vitamin D supplement versus those that don’t

Code
library(rms)
regression.age <- ols(horvath2013 ~ age*vitd_group, data=plot_df)
anova(regression.age)
                Analysis of Variance          Response: horvath2013 

 Factor                                          d.f. Partial SS  MS         
 age  (Factor+Higher Order Factors)                2  13101.38928 6550.694640
  All Interactions                                 1     10.48766   10.487664
 vitd_group  (Factor+Higher Order Factors)         2     10.53266    5.266332
  All Interactions                                 1     10.48766   10.487664
 age * vitd_group  (Factor+Higher Order Factors)   1     10.48766   10.487664
 REGRESSION                                        3  13196.67907 4398.893023
 ERROR                                           424   9344.69488   22.039375
 F      P     
 297.23 <.0001
   0.48 0.4907
   0.24 0.7876
   0.48 0.4907
   0.48 0.4907
 199.59 <.0001
              
Code
regression.age
Linear Regression Model

ols(formula = horvath2013 ~ age * vitd_group, data = plot_df)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs     428    LR chi2    376.87    R2       0.585    
sigma4.6946    d.f.            3    R2 adj   0.583    
d.f.    424    Pr(> chi2) 0.0000    g        6.383    

Residuals

      Min        1Q    Median        3Q       Max 
-13.77295  -2.85638  -0.09115   2.74337  30.33161 

                                   Coef    S.E.   t     Pr(>|t|)
Intercept                          21.2484 6.1011  3.48 0.0005  
age                                 0.6725 0.0919  7.32 <0.0001 
vitd_group=Without Vitamin D       -4.4477 6.4383 -0.69 0.4901  
age * vitd_group=Without Vitamin D  0.0671 0.0973  0.69 0.4907  

with respect to the first question:  are the slopes different?  The p-value for the interaction (age*vitd is the interaction) – there is no statistically significant difference in slopes (p=0.49)

But, there is evidence that age and vitd are predictive of “biologic age” as evidenced by an R2 of 0.5  (the cutoff for observational studies is 0.2).  but we need to removed the interaction term and write the model clock ~ age + vitd

Code
regression.age.noint <- ols(horvath2013 ~ bmi+sex+education+health_medical_cancer+vitamin_d, data=pan)
anova(regression.age.noint)
                Analysis of Variance          Response: horvath2013 

 Factor                d.f. Partial SS   MS          F     P     
 bmi                     1  3.544416e+00   3.5444163  0.07 0.7951
 sex                     1  5.473690e+02 547.3689603 10.44 0.0014
 education               1  6.148916e-01   0.6148916  0.01 0.9138
 health_medical_cancer   1  8.068380e+01  80.6838016  1.54 0.2159
 vitamin_d               1  1.564840e+02 156.4839979  2.98 0.0853
 REGRESSION              5  6.872005e+02 137.4400931  2.62 0.0248
 ERROR                 250  1.310725e+04  52.4289966             
Code
regression.age.noint
Frequencies of Missing Values Due to Each Variable
          horvath2013                   bmi                   sex 
                    0                     2                     0 
            education health_medical_cancer             vitamin_d 
                  171                     0                     0 

Linear Regression Model

ols(formula = horvath2013 ~ bmi + sex + education + health_medical_cancer + 
    vitamin_d, data = pan)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs     256    LR chi2     13.08    R2       0.050    
sigma7.2408    d.f.            5    R2 adj   0.031    
d.f.    250    Pr(> chi2) 0.0226    g        1.796    

Residuals

       Min         1Q     Median         3Q        Max 
-14.615616  -4.898564  -0.006549   4.515122  32.992156 

                      Coef    S.E.   t     Pr(>|t|)
Intercept             63.5517 4.3184 14.72 <0.0001 
bmi                   -0.0206 0.0791 -0.26 0.7951  
sex=Male               3.2862 1.0170  3.23 0.0014  
education             -0.0227 0.2097 -0.11 0.9138  
health_medical_cancer  2.2125 1.7835  1.24 0.2159  
vitamin_d=yes          2.1348 1.2357  1.73 0.0853  
Code
plot(anova(regression.age.noint))

Code
plot(predict(regression.age.noint))